The House Price Prediction project below, shows the use of different machine learning techniques in order to analyze the data available and finally predict the future values of prices using regression models.
With an initial Data Exploratory Analysis, the data will be analyzed to see the main characteristics of the dataset and any relations between the target variable and the independent ones. Also, location analysis will give a sense of where was this data recorded. This EDA process will lead to a Feature Engineering phase were data will be cleaned and transformed so it is ready to be the input of the different models to be tested.
Finally, the data will be splitted in order to have a train and test set. Also, evaluation metrics will be defined and four different regression techniques will be applied in order to evaluate results and pick the best model to generate predictions with the test sample.
For loading the data and initial conversion to a dataframe, the exploratory data analysis with some graphs so this stage is more visual and even the building of the models, R gives all necessary libraries such as data.table, ggplot2, MASS, randomForest, among others.
A list of libraries will be installed and then called in order to be used through the project.
Two CSV files were given in order to complete the project: house_price_train.csv and house_price_test.csv
train_data<-fread('house_price_train.csv', stringsAsFactors = F)
cat('The Dimensions of the Train set are: ', dim(train_data))
## The Dimensions of the Train set are: 17277 21
str(train_data)
## Classes 'data.table' and 'data.frame': 17277 obs. of 21 variables:
## $ id :integer64 9183703376 464000600 2224079050 6163901283 6392003810 7974200948 2426059124 2115510300 ...
## $ date : chr "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(train_data)
## id date price
## Min. : 1000102 Length:17277 Min. : 78000
## 1st Qu.:2113701080 Class :character 1st Qu.: 320000
## Median :3902100205 Mode :character Median : 450000
## Mean :4566440237 Mean : 539865
## 3rd Qu.:7302900090 3rd Qu.: 645500
## Max. :9900000190 Max. :7700000
## bedrooms bathrooms sqft_living sqft_lot
## Min. : 1.000 Min. :0.500 Min. : 370 Min. : 520
## 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5050
## Median : 3.000 Median :2.250 Median : 1910 Median : 7620
## Mean : 3.369 Mean :2.114 Mean : 2080 Mean : 15186
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10695
## Max. :33.000 Max. :8.000 Max. :13540 Max. :1164794
## floors waterfront view condition
## Min. :1.000 Min. :0.000000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.500 Median :0.000000 Median :0.0000 Median :3.000
## Mean :1.493 Mean :0.007467 Mean :0.2335 Mean :3.413
## 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :3.500 Max. :1.000000 Max. :4.0000 Max. :5.000
## grade sqft_above sqft_basement yr_built
## Min. : 3.00 Min. : 370 Min. : 0.0 Min. :1900
## 1st Qu.: 7.00 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median : 7.00 Median :1564 Median : 0.0 Median :1975
## Mean : 7.66 Mean :1791 Mean : 289.4 Mean :1971
## 3rd Qu.: 8.00 3rd Qu.:2210 3rd Qu.: 556.0 3rd Qu.:1997
## Max. :13.00 Max. :9410 Max. :4820.0 Max. :2015
## yr_renovated zipcode lat long
## Min. : 0.00 Min. :98001 Min. :47.16 Min. :-122.5
## 1st Qu.: 0.00 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
## Median : 0.00 Median :98065 Median :47.57 Median :-122.2
## Mean : 85.35 Mean :98078 Mean :47.56 Mean :-122.2
## 3rd Qu.: 0.00 3rd Qu.:98117 3rd Qu.:47.68 3rd Qu.:-122.1
## Max. :2015.00 Max. :98199 Max. :47.78 Max. :-121.3
## sqft_living15 sqft_lot15
## Min. : 460 Min. : 659
## 1st Qu.:1490 1st Qu.: 5100
## Median :1840 Median : 7639
## Mean :1986 Mean : 12826
## 3rd Qu.:2360 3rd Qu.: 10080
## Max. :6210 Max. :871200
head(train_data,3)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1: 9183703376 5/13/2014 225000 3 1.5 1250 7500
## 2: 464000600 8/27/2014 641250 3 2.5 2220 2550
## 3: 2224079050 7/18/2014 810000 4 3.5 3980 209523
## floors waterfront view condition grade sqft_above sqft_basement
## 1: 1 0 0 3 7 1250 0
## 2: 3 0 2 3 10 2220 0
## 3: 2 0 2 3 9 3980 0
## yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1: 1967 0 98030 47.3719 -122.215 1260 7563
## 2: 1990 0 98117 47.6963 -122.393 2200 5610
## 3: 2006 0 98024 47.5574 -121.890 2220 65775
train_data$id <- NULL
price <- train_data$price
train_data$price <- NULL
train_data$price <- price
test_data<-fread('house_price_test.csv', stringsAsFactors = F)
cat('The Dimensions of the Train set are: ', dim(test_data))
## The Dimensions of the Train set are: 4320 20
str(test_data)
## Classes 'data.table' and 'data.frame': 4320 obs. of 20 variables:
## $ id :integer64 6414100192 6054650070 16000397 2524049179 8562750320 7589200193 9547205180 1432701230 ...
## $ date : chr "12/9/2014" "10/7/2014" "12/5/2014" "8/26/2014" ...
## $ bedrooms : int 3 3 2 3 3 3 3 3 3 5 ...
## $ bathrooms : num 2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
## $ sqft_living : int 2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
## $ sqft_lot : int 7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
## $ floors : num 2 1 1 1 2 1.5 1.5 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 4 0 0 0 0 0 0 ...
## $ condition : int 3 4 4 3 3 4 3 4 3 4 ...
## $ grade : int 7 7 7 9 8 8 8 6 8 8 ...
## $ sqft_above : int 2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
## $ sqft_basement: int 400 0 0 720 0 0 790 360 0 1510 ...
## $ yr_built : int 1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
## $ yr_renovated : int 1991 0 0 0 0 0 2002 0 0 0 ...
## $ zipcode : int 98125 98074 98002 98040 98027 98117 98115 98058 98019 98056 ...
## $ lat : num 47.7 47.6 47.3 47.5 47.5 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
## $ sqft_lot15 : int 7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(test_data,3)
## id date bedrooms bathrooms sqft_living sqft_lot floors
## 1: 6414100192 12/9/2014 3 2.25 2570 7242 2
## 2: 6054650070 10/7/2014 3 1.75 1370 9680 1
## 3: 16000397 12/5/2014 2 1.00 1200 9850 1
## waterfront view condition grade sqft_above sqft_basement yr_built
## 1: 0 0 3 7 2170 400 1951
## 2: 0 0 4 7 1370 0 1977
## 3: 0 0 4 7 1200 0 1921
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1: 1991 98125 47.7210 -122.319 1690 7639
## 2: 0 98074 47.6127 -122.045 1370 10208
## 3: 0 98002 47.3089 -122.210 1060 5095
test_labels <- test_data$id
test_data$id <- NULL
The first step is to check if the datasets contain missing values in order to remove them. Also, duplicated rows will be analyzed in order to remove them.
## The number of missing values on TRAIN are 0
## The number of missing values on TRAIN are 0
## The number of duplicated rows on TRAIN are 0
## The number of duplicated rows on TEST are 0
As both datasets do not have any missing values or duplicated rows, the next step is to analyze the data with basic data visualization plots.
The target variable that we want to predict is Price so first a summary of the continuous variable and distribution plot will give initial description.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78000 320000 450000 539865 645500 7700000
Looking at the graph, the target variable distribution is right-skewed and even some outliers are detected.
Having the coordinates of each house (longitude and latitude) a visual representation were built in order to see where is this data coming from (Seattle) and even separated in different clusters based on price ranges.
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
One of the main analysis is to understand the evolution of prices over the years and even how many houses were built so we can see more relations between time or some specific events that could even impact.
The data available is from 2014 and 2015. So looking at the first time series plot, the prices for this two years are not stationary meaning that there could be a relation that explains the future with past data so with regression future prices could be predicted.
The dataset contains all the data from each house bought such as the year when it was built so using this information, two different plots will show how prices change over the years when houses were built and how many houses were sold. The curve shows how prices started to decay and then they grew back again this could be due to specific events on the 60’s such as the Cold War.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The datasets have all numerical variables except for the date so in order to see the distributions in a clearer way, the data variables were splitted between discrete and continuous variables so different types of chart could be used (bar charts or density charts)
Most of the distribution are skewed to the right.
As price is plotted against other independent variables, outliers and even how they could be related will give initial insights for next feature engineering phase.
With the above graphs outliers are clearly detected and some correlation between the variables may be detected but this will be checked on next steps with the proper correlation analysis.
A general correlation plot looking at all the variables against the target and each other. Even a small correlation plot only with high correlations will help to understand better which variables are more correlated and how to treat them later.
Highly correlated variables are considered to be does that have a coefficient higher than 0.80 but in this a threshold of 0.5 were used.
Both train and test datasets provided will be combined in order to apply some feature engineering.
Also, a new variables ‘isTrain’ will help to split it again based on how they were given.
train <- train_data
test <- test_data
test$price <- NA
str(train)
## Classes 'data.table' and 'data.frame': 17277 obs. of 20 variables:
## $ date : chr "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame': 4320 obs. of 20 variables:
## $ date : chr "12/9/2014" "10/7/2014" "12/5/2014" "8/26/2014" ...
## $ bedrooms : int 3 3 2 3 3 3 3 3 3 5 ...
## $ bathrooms : num 2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
## $ sqft_living : int 2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
## $ sqft_lot : int 7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
## $ floors : num 2 1 1 1 2 1.5 1.5 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 4 0 0 0 0 0 0 ...
## $ condition : int 3 4 4 3 3 4 3 4 3 4 ...
## $ grade : int 7 7 7 9 8 8 8 6 8 8 ...
## $ sqft_above : int 2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
## $ sqft_basement: int 400 0 0 720 0 0 790 360 0 1510 ...
## $ yr_built : int 1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
## $ yr_renovated : int 1991 0 0 0 0 0 2002 0 0 0 ...
## $ zipcode : int 98125 98074 98002 98040 98027 98117 98115 98058 98019 98056 ...
## $ lat : num 47.7 47.6 47.3 47.5 47.5 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
## $ sqft_lot15 : int 7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
## $ price : logi NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
train$isTrain <- 1
test$isTrain <- 0
house_base <- rbind(train,test)
str(house_base)
## Classes 'data.table' and 'data.frame': 21597 obs. of 21 variables:
## $ date : chr "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ isTrain : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
New variables were created such as the Age of a house or if it was or not renovated. Finally, some features as date, latitude and longitude will not be used as input for the models. As some high correlations appeared, sqft_above will be dropped too having a coefficient of 0.88 with sqft_living
##### **NEW FEATURES** #####
library(lubridate)
house_base$houseAge <- year(Sys.time()) - house_base$yr_built
house_base$renovated <- ifelse(house_base$yr_renovated == 0, 0, 1)
#house_base[, c('date','sqft_above', 'latitude', 'longitude'):=NULL]
house_base[ ,c('date','sqft_above', 'lat', 'long')] <- list(NULL)
str(house_base)
## Classes 'data.table' and 'data.frame': 21597 obs. of 19 variables:
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ isTrain : num 1 1 1 1 1 1 1 1 1 1 ...
## $ houseAge : num 52 29 13 52 68 11 24 36 32 60 ...
## $ renovated : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(house_base,3)
## bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1: 3 1.5 1250 7500 1 0 0
## 2: 3 2.5 2220 2550 3 0 2
## 3: 4 3.5 3980 209523 2 0 2
## condition grade sqft_basement yr_built yr_renovated zipcode
## 1: 3 7 0 1967 0 98030
## 2: 3 10 0 1990 0 98117
## 3: 3 9 0 2006 0 98024
## sqft_living15 sqft_lot15 price isTrain houseAge renovated
## 1: 1260 7563 225000 1 52 0
## 2: 2200 5610 641250 1 29 0
## 3: 2220 65775 810000 1 13 0
All data was centered and scaled in order to apply all models in the next stage.
Outliers will be kept in this case, but for next steps or future improvements of the model they could be removed.
So the final dataset, ready to be used for modeling is:
price <- house_base$price
isTrain <- house_base$isTrain
renovated <- house_base$renovated
yr_renovated <- house_base$yr_renovated
yr_built <- house_base$yr_built
zipcode <- house_base$zipcode
house_base[, c('price', 'isTrain','renovated', 'zipcode', 'yr_renovated','yr_built'):=NULL]
#NORMALIZING
num <- preProcess(house_base, method=c("center", "scale"))
house_base <- predict(num, house_base)
house_final <- cbind(house_base,renovated, zipcode, yr_built, yr_renovated, isTrain, price)
str(house_final)
## Classes 'data.table' and 'data.frame': 21597 obs. of 19 variables:
## $ bedrooms : num -0.403 -0.403 0.677 0.677 0.677 ...
## $ bathrooms : num -0.801 0.5 1.8 -0.801 -0.476 ...
## $ sqft_living : num -0.904 0.152 2.069 -0.207 -0.29 ...
## $ sqft_lot : num -0.184 -0.303 4.695 -0.183 -0.244 ...
## $ floors : num -0.916 2.79 0.937 -0.916 -0.916 ...
## $ waterfront : num -0.0872 -0.0872 -0.0872 -0.0872 -0.0872 ...
## $ view : num -0.306 2.304 2.304 -0.306 -0.306 ...
## $ condition : num -0.63 -0.63 -0.63 0.907 0.907 ...
## $ grade : num -0.561 1.996 1.144 -0.561 -0.561 ...
## $ sqft_basement: num -0.659 -0.659 -0.659 -0.659 1.306 ...
## $ sqft_living15: num -1.06 0.311 0.341 -0.141 -1.017 ...
## $ sqft_lot15 : num -0.19 -0.262 1.944 -0.156 -0.284 ...
## $ houseAge : num 0.136 -0.647 -1.191 0.136 0.681 ...
## $ renovated : num 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ isTrain : num 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(house_final)
## bedrooms bathrooms sqft_living sqft_lot
## Min. :-2.5620 Min. :-2.1012 Min. :-1.8629 Min. :-0.3520
## 1st Qu.:-0.4029 1st Qu.:-0.4757 1st Qu.:-0.7083 1st Qu.:-0.2429
## Median :-0.4029 Median : 0.1745 Median :-0.1855 Median :-0.1807
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6767 3rd Qu.: 0.4996 3rd Qu.: 0.5116 3rd Qu.:-0.1066
## Max. :31.9841 Max. : 7.6519 Max. :12.4819 Max. :39.5111
##
## floors waterfront view condition
## Min. :-0.91553 Min. :-0.0872 Min. :-0.3057 Min. :-3.7043
## 1st Qu.:-0.91553 1st Qu.:-0.0872 1st Qu.:-0.3057 1st Qu.:-0.6300
## Median : 0.01094 Median :-0.0872 Median :-0.3057 Median :-0.6300
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.93741 3rd Qu.:-0.0872 3rd Qu.:-0.3057 3rd Qu.: 0.9072
## Max. : 3.71682 Max. :11.4669 Max. : 4.9136 Max. : 2.4444
##
## grade sqft_basement sqft_living15 sqft_lot15
## Min. :-3.9703 Min. :-0.659 Min. :-2.3169 Min. :-0.44391
## 1st Qu.:-0.5608 1st Qu.:-0.659 1st Qu.:-0.7247 1st Qu.:-0.28079
## Median :-0.5608 Median :-0.659 Median :-0.2140 Median :-0.18839
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2916 3rd Qu.: 0.606 3rd Qu.: 0.5449 3rd Qu.:-0.09809
## Max. : 4.5534 Max. :10.230 Max. : 6.1634 Max. :31.47422
##
## houseAge renovated zipcode yr_built
## Min. :-1.4979 Min. :0.00000 Min. :98001 Min. :1900
## 1st Qu.:-0.8851 1st Qu.:0.00000 1st Qu.:98033 1st Qu.:1951
## Median :-0.1362 Median :0.00000 Median :98065 Median :1975
## Mean : 0.0000 Mean :0.04232 Mean :98078 Mean :1971
## 3rd Qu.: 0.6808 3rd Qu.:0.00000 3rd Qu.:98118 3rd Qu.:1997
## Max. : 2.4170 Max. :1.00000 Max. :98199 Max. :2015
##
## yr_renovated isTrain price
## Min. : 0.00 Min. :0.0 Min. : 78000
## 1st Qu.: 0.00 1st Qu.:1.0 1st Qu.: 320000
## Median : 0.00 Median :1.0 Median : 450000
## Mean : 84.46 Mean :0.8 Mean : 539865
## 3rd Qu.: 0.00 3rd Qu.:1.0 3rd Qu.: 645500
## Max. :2015.00 Max. :1.0 Max. :7700000
## NA's :4320
After splitting the house_base dataset into train and test, again both have the same number of rows and are ready to be used.
train_model <- house_final[house_final$isTrain==1,]
test_model <- house_final[house_final$isTrain==0,]
smp_size <- floor(0.75 * nrow(train_model))
set.seed(123)
train_ind <- sample(seq_len(nrow(train_model)), size = smp_size)
train_new <- train_model[train_ind, ]
test_new <- train_model[-train_ind, ]
nrow(train_new)
## [1] 12957
nrow(test_new)
## [1] 4320
train_new$isTrain <- NULL
test_new$isTrain <- NULL
After applying each model, a formula is set to be used in each as we want to predict the prices against all the rest of the independent variables.
Also, some metrics are defined and will be used to evaluate the outcome of each model and do a final evaluation to pick the best.
#### FORMULA
formula<-as.formula(price~.)
#METRICS
mape<-function(real,predicted){return(mean(abs((real-predicted)/real)))}
mae<-function(real,predicted){return(mean(abs(real-predicted)))}
rmse<-function(real,predicted){return(sqrt(mean((real-predicted)^2)))}
##
## Call: glmnet(x = data.matrix(train_new[, !"price"]), y = train_new[["price"]], family = "gaussian", alpha = 1, lambda = lasso_cv$lambda.min)
##
## Df %Dev Lambda
## [1,] 17 0.6657 598.4
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## bedrooms -28227.731784
## bathrooms 28556.726059
## sqft_living 137820.947906
## sqft_lot -212.101496
## floors 17427.876046
## waterfront 46832.914208
## view 34754.358019
## condition 14412.895924
## grade 145345.810730
## sqft_basement 5265.698159
## sqft_living15 16770.062405
## sqft_lot15 -14116.564072
## houseAge 101720.540892
## renovated 26464.991208
## zipcode 14.455318
## yr_built -10.511604
## yr_renovated 4.099015
## RMSE MAE MAPE
## [1,] 231312.4 139965 29.1
## RMSE MAE MAPE
## [1,] 184895.8 100685.6 20
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + grade + sqft_basement + sqft_living15 +
## sqft_lot15 + houseAge + renovated + yr_renovated, data = train_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1223760 -108376 -11288 90776 4371207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 538276.4 1872.6 287.456 < 2e-16 ***
## bedrooms -29969.3 2274.0 -13.179 < 2e-16 ***
## bathrooms 28785.0 3368.7 8.545 < 2e-16 ***
## sqft_living 138529.9 4391.1 31.548 < 2e-16 ***
## floors 18943.1 2529.0 7.490 7.32e-14 ***
## waterfront 47567.6 2012.6 23.635 < 2e-16 ***
## view 34869.5 2150.4 16.215 < 2e-16 ***
## condition 15389.0 2018.9 7.622 2.66e-14 ***
## grade 144732.0 3312.7 43.691 < 2e-16 ***
## sqft_basement 5549.5 2475.9 2.241 0.025 *
## sqft_living15 17699.7 3048.3 5.806 6.53e-09 ***
## sqft_lot15 -15144.4 2005.5 -7.552 4.59e-14 ***
## houseAge 103520.3 2597.5 39.854 < 2e-16 ***
## renovated -4991309.2 1147050.4 -4.351 1.36e-05 ***
## yr_renovated 2518.7 574.7 4.383 1.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208000 on 12942 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6659
## F-statistic: 1845 on 14 and 12942 DF, p-value: < 2.2e-16
## RMSE MAE MAPE
## [1,] 230779.8 139745.5 29
## [1] train-rmse:477483.093750
## [2] train-rmse:359726.156250
## [3] train-rmse:281496.093750
## [4] train-rmse:228993.453125
## [5] train-rmse:196445.343750
## [6] train-rmse:174051.703125
## [7] train-rmse:159525.312500
## [8] train-rmse:150715.093750
## [9] train-rmse:140099.703125
## [10] train-rmse:135083.375000
## [11] train-rmse:131570.000000
## [12] train-rmse:128801.375000
## [13] train-rmse:126498.015625
## [14] train-rmse:122762.968750
## [15] train-rmse:119967.203125
## [16] train-rmse:117354.500000
## [17] train-rmse:116233.812500
## [18] train-rmse:112307.132812
## [19] train-rmse:111647.343750
## [20] train-rmse:109219.843750
## [21] train-rmse:108185.234375
## [22] train-rmse:106265.015625
## [23] train-rmse:104316.257812
## [24] train-rmse:103114.523438
## [25] train-rmse:102056.515625
## [26] train-rmse:100778.070312
## [27] train-rmse:99862.625000
## [28] train-rmse:98468.093750
## [29] train-rmse:97836.898438
## [30] train-rmse:97140.187500
## [31] train-rmse:96321.273438
## [32] train-rmse:95929.156250
## [33] train-rmse:94263.242188
## [34] train-rmse:92995.500000
## [35] train-rmse:91637.539062
## [36] train-rmse:90815.960938
## [37] train-rmse:90135.125000
## [38] train-rmse:89385.468750
## [39] train-rmse:88445.664062
## [40] train-rmse:86862.632812
## [41] train-rmse:86106.320312
## [42] train-rmse:85549.445312
## [43] train-rmse:85265.031250
## [44] train-rmse:84441.882812
## [45] train-rmse:83766.250000
## [46] train-rmse:82960.976562
## [47] train-rmse:82489.421875
## [48] train-rmse:81461.023438
## [49] train-rmse:81105.289062
## [50] train-rmse:80669.539062
## [51] train-rmse:80058.250000
## [52] train-rmse:79767.085938
## [53] train-rmse:78976.695312
## [54] train-rmse:78341.156250
## [55] train-rmse:77761.132812
## [56] train-rmse:77004.453125
## [57] train-rmse:76689.328125
## [58] train-rmse:76288.867188
## [59] train-rmse:75589.890625
## [60] train-rmse:74811.695312
## RMSE MAE MAPE
## [1,] 156843.4 79998.87 15.1
The MAPE or Mean Absolute Percentage Error will be the main metric to be used in order to compare the results.
## method rmse mae mape
## 1: glmnet 231312.4 139964.95 0.2906808
## 2: rf 184895.8 100685.59 0.2003794
## 3: lm 230779.8 139745.52 0.2900942
## 4: xgb 156843.4 79998.87 0.1508053
With a lower MAPE the model to be used in order to predict the house prices will be **XGBoosting Tress*
Finally, all the prices for the test_id labels will be predicted and stored in a .txt file.
pred <- data.frame(id=test_labels,price=round(df_predicted$test_xgb))
write.table(pred,file="House_Price_Pred_1.txt",row.names=F, sep = ',')
House prices were predicted using a machine learning process, where the data was analyzed and presented in a visual way to get insights. Some techniques were applied to clean and transformed the data so it was splitted and used to create different models such as Random Forest Tree, Lasso Regression, Regression with Stepwise Feature Selection and XGBoosting Tree which had the lower MAPE so this was the final model used to predict the prices from the test dataset.
As future improvements on the model, outliers could be removed and parameters could me tuned in order to see if the results improve.